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Abstract. Activation Functions introduce non-linearity in the deep neural 
networks. This nonlinearity helps the neural networks learn faster and 
efficiently from the dataset. In deep learning, many activation functions are 
developed and used based on the type of problem statement. ReLU’s variants, 
SWISH, and MISH are goto activation functions. MISH function is considered 
having similar or even better performance than SWISH, and much better than 
ReLU. In this paper, we propose an activation function named APTx which 
behaves similar to MISH, but requires lesser mathematical operations to 
compute. The lesser computational requirements of APTx does speed up the 
model training, and thus also reduces the hardware requirement for the deep 
learning model. 
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1 Introduction 


The ability of deep learning models to learn features directly from the data have made 
it a default approach to solve many complex problems. A simple artificial neuron is 
linear in nature, also expressed in Equation 1. 


y = Ywix, +b (1) 
Here, 
y is output from the neuron 
x;is the input to the neuron 
w;is the associated weights 
b is the associated bias 


When the output of this neuron is passed to an activation function the nonlinearity 
gets introduced in the network. When considering an activation function one 
important thing is that the derivative of an activation function should not be the same 
in its domain. Generally, activation function f is applied to the output of the neurons 
in the hidden layers to make the neural network learn complex features as expressed 
in Equation 2. 


output = f(y) (2) 


The SWISH activation function is considered better than the ReLU function and its 
variants. But, recently developed activation function MISH is considered equivalent 
or even better than SWISH activation function in some cases. 


In this paper, we propose an activation function APTx which behaves similar to the 
MISH activation function but requires lesser mathematical operations. It means lesser 
computation is required in APTx to calculate output in the forward propagation, as a 
result significantly reducing the hardware requirements for training and inference 
phases. The derivative of APTx also has lesser operations than MISH, hence making 
neural networks train faster compared to MISH activation function. 


2 Related Works 


Vinod Nair et al. [1] studied the effect of rectified linear units (ReLU) on Restricted 
Boltzmann Machines. Abien Fred M. Agarap [2] made use of ReLU with 
convolutional neural networks on the MNIST dataset which outperformed the CNN 
with softmax on classification task. Glorot et al. [3] and Sun et al. [4] discussed the 
sparsity of ReLU as a reason for its better performance. Szandala, Tomasz et al. [5] 
performed a comparative analysis showing tanh and sigmoid function both having 
vanishing gradient problems overcome by ReLU, and showing the dying-ReLU 
problem for negative values. Mass et al. [6] presented an improved version of ReLU 
called Leaky-ReLU where instead of having zero value for negative input the function 
will have some negative number output. Clevert et al. [7] proposed an ELU function 
that was faster and better than both ReLU and Leaky-ReLUs. Ramachandran P et al 
[8] presented SWISH activation function having superior performance than ReLU and 
its variants. Misra D. et al. [9] proposed an activation function MISH having similar, 
and in some cases even better performance than SWISH activation function. 


a Proposed APTx activation function 


We are proposing an activation function named as “Alpha Plus Tanh Times” or APTx 
in short. Our APTx function is presented as @ in Equation 3, and its derivative is 
shown in Equation 4. 


(x) = (@ + tanh(Bx)) * yx (3) 
o'(x) = y(a@ + tanh(Px) + Bx sech (B x)) (4) 


By updating the values of the parameters a, 6B, and y we can make the function @ 
behave like a MISH activation function. The updated function @ and it’s derivative is 
shown in Equation 5 and 6, where a= 1, 8 =1 andy ='4 


m(x) = (1 + tanh(x)) * x/2 (5) 
(x) = (1 + tanh(x) + xsech’(x))/2 (6) 
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For the detailed visual analysis of the behavior of our APTx its graph is shown in 
Figure 1, and the graph of its derivative is shown in Figure 2. 


’ 
Figure 1: Graph of our proposed APTx activation function ata =1,6=1andy=' 
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Figure 2: Graph of the derivative of APTx activation function ata =1,8=1 andy='% 


Although one decides activation functions based on the type of the problem statement, 
there are some popular activation functions whose comparisons were already done in 
existing research works. First, we discuss how MISH activation function is better than 
SWISH, ELU, Leaky-ReLU, ReLU, Tanh and Sigmoid activation function for general 
scenarios. Afterwards, we compared the MISH activation function with our proposed 
APTx function. 


4 Comparative analysis of existing activation functions 


The sigmoid activation function is mathematically expressed in Equation 7, and 
comparison of its derivative with the derivative of tanh is shown in Figure 3. One can 
easily notice in Figure 3 that the range of tanh derivatives is larger than sigmoid 
derivatives, but for numbers away from zero both tanh and sigmoid have very less 
output, this introduces the Vanishing Gradient Problem [5] in the larger neural 
networks. 


Sigmoid(x) = 1/1 +e) (7) 


A 
4 —— Derivative of tanh 


-== Derivative of 
Sigmoid 
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Figure 3: Graph showing derivatives of tanh and sigmoid activation functions 


The ReLU activation function provided a solution to the Vanishing Gradient Problem 
at least for the positive inputs [3-4], but for the negative inputs it suffers from the 
Dying-ReLU problem [5], as its derivative for negative value is Zero. Leaky-ReLU 
[6] was able to solve the Dying-ReLU problem upto some extent. ELU [7] showed 
better performance than Leaky-ReLU in most of the tasks as it tends to converge cost 
to zero faster and produce accurate results. For the positive input ReLU, 


5 


Leaky-ReLU, and ELU all behave in the same manner, but the difference lies for the 
non-positive values as shown in Figure 4 and also presented in Equations 8, 9, and 10 
respectively. 


v 
Figure 4: Graph of ReLU, Leaky-ReLU (with @ = 0.05), and ELU (with @ = 2) 


ReLU(x) = max(0, x) (8) 
LeakyReLU(x) = { ax, x <= 0}, and (9) 
{x, x > 0} 
ELU(x) = {a(e— 1), x <= O},and (10) 
hal x > O} 


SWISH activation function [8] performs better than ReLU activation function, and 
also its variants because none of these variants have managed to replace the 
inconsistent gains (i.e. calculation of derivatives). SWISH can be considered a type of 
self-gated function, also expressed in Equation 11. 


SWISH(x) = x * Sigmoid(x) (11) 
Although introduction of SWISH solved both vanishing gradient and providing 
consistent gains, development of MISH activation function [9] turned out to provide 


equivalent and in many tasks it had even better performance than SWISH activation 
function. Its mathematical form is presented in Equation 12. 


MISH(x) = x * tanh(In(1 + e’)) (12) 


Graphs of the derivatives of SWISH and MISH functions are plotted in Figure 5. 


| 


’ 
Figure 5: Graph of the derivatives of SWISH and MISH activation functions 


5 Comparative analysis of MISH with proposed APTx 


During the forward propagation the mathematical operations required to calculate 
APTx expressed in Equation 5 are lesser than the MISH activation function shown in 
Equation 12. But, similar to the MISH function, APTx is bounded below and 
unbounded above. 


The biggest advantage of APTx lies during the training phase while performing 
backpropagation. Back Propagation requires calculation of derivatives for each epoch 
and APTx requires fewer mathematical operations to compute its derivative than the 
MISH activation function. The derivative of MISH is expressed in Equation 13, and 
for comparative analysis the derivative of APTx is stated again in the Equation 14, 
where a= 1,8 =landy=%. 


MISH'(x) = (e*(4(x + 1) + 4e" +e + e*(4x + 6)) )/(2e* te + 2) (13) 


g(x) = (1 + tanh(x) + xsech’(x))/2 (14) 
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Interestingly, despite the fact that the derivative of the APTx function requires fewer 
operations than the derivative of MISH and also SWISH. The derivative graphs of 
APTx and MISH are presented in Figure 6 showing similar behavior for the positive 
domain part, useful for backpropagation. 


--=-= Derivative of MISH 
Derivative of APTx 


Figure 6: Graph of derivatives of MISH and APTx with a= 1,8 =1 andy ='% 


Even more overlapping between MISH, and APTx derivatives can be generated by 
varying values for aw , 6 and y parameters, such asa =1, 8 ='Zandy=' closely 
maps the negative domain part of the APTx with the negative part of MISH. When aw 
= 1, f= 1 andy = % the positive domain part of APTx closely maps with the 
positive part of MISH. 


So, we can use a =1, 6 ='4and y = % values for the negative part, and@w=1,6=1 
and y = 4 for the positive part in case we want to closely approximate the MISH 
activation function. 


Interestingly, APTx function with parameters a = 1, 6 ='2 and y=’ behaves like 
the SWISH(x, 1) activation function, and APTx with a =1, 6 =1 and y = % behaves 
like SWISH(x, 2). 


Our APTx activation function requires lesser computations in forward propagation 
and its derivative also needs lesser computations during backward propagation when 
compared with MISH activation function. 


6 Conclusion 


MISH has similar or even better performance than SWISH which is better than the 
rest of the activation functions. Our proposed activation function APTx behaves 
similar to MISH but requires lesser mathematical operations in calculating value in 
forward propagation, and derivatives in backward propagation. This allows APTx to 
train neural networks faster and be able to run inference on low-end computing 
hardwares such as neural networks deployed on low-end edge-devices with Internet of 
Things. Interestingly, using APTx one can also generate the SWISH(x, P) activation 
function at parameters a=1,6= 9/2 andy=/%. 
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